The dataset is extracted from a large database consisting of 76 variables, and it uses only 15 variables from the original data. The objective of the field is to test whether there is the presence of heart disease in a patient. It is integer valued from 0 (no presence) to 4. However, experiments with the database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). Therefore, in terms of this dataset, we will test the efficiency of different classification algorithms. The label variable is the "target" variable containing only two values: 0 and 1.
15 attributes used:
Input variables:
Predicted Variable:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import time
#Load the data
heart = pd.read_csv("D:/Business Analytics Program/Courses/Applied Machine Learning/datasets/heart.csv",)
heart.head(5)
#Review the data
#heart = heart.drop(['patientid','fbs','restecg','exang','ca','slope','sex','cp','thal'],axis =1)
heart = heart.drop(['patientid'],axis =1)
print(heart.dtypes)
rows, columns = heart.shape
print("Rows:", rows)
print("Columns:", columns)
Totally, the dataset includes 302 patients or datapoints. The "patientid" is dropped here because it has little meaning.
#check if any variables containing null values
heart.isnull().any()
We can see that there is no attribute containing null value.
#Use pairs plot illustrate the data
sns.pairplot(heart.iloc[:,7:15])
Here, I built some plots which express the relationship between each pair of variable among these variables:"thalach","exang","oldpeak","slope","ca","thal","target".
Inferred from the graphs in the last row, there are obviously two classes in the "target" variable.
#Split the data set in predictors and predicted
feature_cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
X = heart[feature_cols] # Features
y = heart.target # Target variable
To evaluate the consistency of the results and build learning curves for each algorithm, the testing size will be set to run between 0.05 and 0.95 (testing size can not be 0 and 1 either), the learning rate will be 0.05 and that makes a total number of 19 testing sizes. Each algorithm will be run in 3 three times (each time with a random state number range between 0 and 2).
Important parameters for Logistic Regression in scikit learn:
penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’
The used dataset is a binary one, so the multi_class parameter will be assigned "ovr" (one versus rest) or if "auto" is chosen, the parameter will be automatically assigned "ovr".
‘lbfgs’ solver and 'newton-cg' solver supports l2 or none penalty, which specifies the norm used in the penalization while 'sag' solver only supports l2 penalty.
'saga' solver supports all the kind of penalty. When choosing penalty = 'elasticnet' and choose the value of l1_ratio between 0 and 1, the norm becomes a combination of l1 and l2.
'liblinear' solver can handle l1 penalty, but it does not support penalty = none.
# Approach 1: solver = lbfgs (default)
lbfgs_train_accuracy = np.zeros([3,19])
lbfgs_test_accuracy = np.zeros([3,19])
lbfgs_time = np.zeros([3,19])
lbfgs_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l2',solver = 'lbfgs')
LogisticReg.fit(X_train,y_train)
y_pred_Train = LogisticReg.predict(X_train)
y_pred_Test = LogisticReg.predict(X_test)
End_time = time.time() #Saving current time
lbfgs_size = np.append(lbfgs_size,j)
lb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
lbfgs_train_accuracy[i,k] = lb1
lb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
lbfgs_test_accuracy[i,k] = lb2
lb3 = float(round(End_time - Start_time,6))
lbfgs_time[i,k] = lb3
j+= 0.05
k+= 1
j = 0.05
k = 0
In the result, we found that this model encounter a problem with the number of iterations. One possible solution is the data scaling, so I will adopt this method to the model.
Other solver options named: ‘newton-cg’,‘sag’, ‘saga’ face the same problem and they demand scaling the data before running the algorithm. Therefore, I use the X_scaled instead of the orginal X_train with these solvers.
‘liblinear’ solver is the only one which does not face this problem.
from sklearn import preprocessing
# Approach 1: solver = lbfgs (default)
lbfgs_train_accuracy = np.zeros([3,19])
lbfgs_test_accuracy = np.zeros([3,19])
lbfgs_time = np.zeros([3,19])
lbfgs_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l2',solver = 'lbfgs')
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
lbfgs_size[i,k] = j
lb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
lbfgs_train_accuracy[i,k] = lb1
lb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
lbfgs_test_accuracy[i,k] = lb2
lb3 = float(round(End_time - Start_time,6))
lbfgs_time[i,k] = lb3
j+= 0.05
k+= 1
j = 0.05
k = 0
# plot the results:
x = lbfgs_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,lbfgs_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,lbfgs_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Lbfgs solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The training set accuracy is high in all the random states, often more than 85% and it tends to increase.
The testing set accuracy seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%. In random state 2, testing accuracy gradually fell from a peak of nearly 95%.
# plot the results:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,lbfgs_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Lbfgs solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time of the algorithms varies considerably when the testing size changes in all the random states.
Overall, when the testing size increases, the training set accuracy rises but the testing set accuracy tends to decrease. It is best to choose a testing size where the testing set accuracy begins to decrease, that is when the model starts overfitting. The running time can also be considered if the accuracy values obtained when changing the testing size are similar.
Inferred from the graphs, in random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.
# Approach 2: solver = sag
sag_train_accuracy = np.zeros([3,19])
sag_test_accuracy = np.zeros([3,19])
sag_time = np.zeros([3,19])
sag_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l2',solver = 'sag')
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
sag_size[i,k] = j
sag1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
sag_train_accuracy[i,k] = sag1
sag2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
sag_test_accuracy[i,k] = sag2
sag3 = float(round(End_time - Start_time,6))
sag_time[i,k] = sag3
j+= 0.05
k+= 1
j = 0.05
k = 0
# plot the results:
x = sag_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,sag_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,sag_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('sag solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
There is a great similarity between the results obtained with 'sag' and these obtained with 'lbfgs'.
The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.
The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.
# plot the results:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,sag_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Sag solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
However, the running time is extremely different when the testing size changes among the random states.
The best testing size for each state is similar to the numbers chosen for 'lbfgs'.
In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.
Next, we will talk about 'saga' solver. When it comes to 'saga', we should choose the penalty parameter as 'elasticnet' and we can choose the l1_ratio between 0 and 1. l1_ratio = 0 make 'elasticnet' = 'l1'; l1_ratio = 1 make 'elasticnet' = 'l2'; 0 < l1_ratio < 1 makes it a combination of 'l1' and 'l2' penalty. Here we can find the best learning rate for l1_ratio in three different random states.
# Approach 3: solver = saga
# penalty = elasticnet
# find the best l1_ratio
# choose the common testing size = 0.2
saga_train_accuracy = np.zeros([3,20])
saga_test_accuracy = np.zeros([3,20])
saga_time = np.zeros([3,20])
saga_ratio = np.zeros([3,20])
j = 0
k = 0
for i in range (3):
while j <=1.0:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'elasticnet',solver = 'saga',l1_ratio = j)
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
saga_ratio[i,k] = j
saga1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
saga_train_accuracy[i,k] = saga1
saga2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
saga_test_accuracy[i,k] = saga2
saga3 = float(round(End_time - Start_time,6))
saga_time[i,k] = saga3
j += 0.05
k += 1
j = 0
k = 0
Here is the place where I got trouble with my code, Professor. I could not get the value 1 for l1_ratio.
#plot the results with saga
x = saga_ratio[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,saga_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,saga_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('l1_ratio')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Saga solver with test size =0.2_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In these three random states, the testing set accuracy is stable so it is sensible tp pick a l1_ratio which brings out a high training set accuracy. Inferred from the three graphs, a l1_ration ranging between 0.25 and 0.3 will be good in all the random states. Therefore, I pick l1_ratio = 0.25.
# plot the results with saga:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,saga_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('L1_ratio')
ax[i].set_ylabel('Running time')
ax[i].set_title('Saga solver with test size =0.2_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time fluctuates in all the three random states. We can see that with l1_ratio = 0.25, the running time of the algorithm is lowest in two random states (0 and 1).
Here we have l1_ratio = 0.25, we will find the best testing size to work with it.
# Approach 3: solver = saga
# penalty = elasticnet
# l1_ratio = 0.25
# find the best testing size
saga_train_accuracy = np.zeros([3,19])
saga_test_accuracy = np.zeros([3,19])
saga_time = np.zeros([3,19])
saga_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'elasticnet',solver = 'saga',l1_ratio = 0.25)
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
saga_size[i,k] = j
saga1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
saga_train_accuracy[i,k] = saga1
saga2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
saga_test_accuracy[i,k] = saga2
saga3 = float(round(End_time - Start_time,6))
saga_time[i,k] = saga3
j+= 0.05
k+= 1
j = 0.05
k = 0
#plot the results with saga
x = saga_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,saga_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,saga_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Saga solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
These graphs should remind us of those with 'lbfgs' and 'sag'.
The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.
The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.
# plot the results with saga:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,saga_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Saga solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The lowest running time is recognized in random state 0, the figures for the two other random states varies.
Overall, the best testing size:
In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.
# Approach 4: solver = newton-cg
newton_train_accuracy = np.zeros([3,19])
newton_test_accuracy = np.zeros([3,19])
newton_time = np.zeros([3,19])
newton_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l2',solver = 'newton-cg')
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
newton_size[i,k] = j
newton1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
newton_train_accuracy[i,k] = newton1
newton2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
newton_test_accuracy[i,k] = newton2
newton3 = float(round(End_time - Start_time,6))
newton_time[i,k] = newton3
j+= 0.05
k+= 1
j = 0.05
k = 0
#plot the results with newton_cg
x = newton_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,newton_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,newton_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Newton_cg solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Using 'newton_cg' solver also brings the similar results as the three previous options.
The training set accuracy is also high in all the random states, often more than 85% and it tends to increase.
The testing set accuracy also seems to fluctuate more greatly in random_state 0 and 1. In these two states, this number mostly ranges between 75% and 85%.
# plot the results with newton_cg:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,newton_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Newton_cg solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The lowest running time is recognized in random state 0, the figures for the two other random states varies.
Overall, the best testing size for 'newton_cg' solver:
In random_state_0: it is best to choose testing size 20%. The figures for random_state_1 and random_state_2 are 10% and 5% respectively.
Now, it comes to 'liblinear' option. 'liblinear' solver is the only option which does not face the same problem as other solvers. Therefore, here we can both choose to run the algorithm with 'liblinear' using the the original training dataset and the scaled dataset.
# Approach 5: solver = liblinear
# Using the original training and testing dataset
# for solver ='liblinear', it is useful to set fit_intercept ='True' and increase the intercept_scaling
# set intercept_scaling = 10
liblinearor_train_accuracy = np.zeros([3,19])
liblinearor_test_accuracy = np.zeros([3,19])
liblinearor_time = np.zeros([3,19])
liblinearor_size = np.zeros([3,19])
j = 0.05
k = 0
for i in range (3):
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l1',solver = 'liblinear',fit_intercept = 'True',intercept_scaling = 10)
LogisticReg.fit(X_train,y_train)
y_pred_Train = LogisticReg.predict(X_train)
y_pred_Test = LogisticReg.predict(X_test)
End_time = time.time() #Saving current time
liblinearor_size[i,k] = j
liblinearor1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
liblinearor_train_accuracy[i,k] = liblinearor1
liblinearor2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
liblinearor_test_accuracy[i,k] = liblinearor2
liblinearor3 = float(round(End_time - Start_time,6))
liblinearor_time[i,k] = liblinearor3
j+= 0.05
k+= 1
j = 0.05
k = 0
#plot the results with liblinear_or
x = liblinearor_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearor_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,liblinearor_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Liblinear_or solver Model_random state_%i' % i) #liblinear_or means the liblinear with the original dataset.
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
All the accuracy metric scores look good (most of training set accuracy values are higher than 85%). Among the three random states, the random state 1 witness a steady increase in the training set accuracy.
# plot the results with liblinear_or:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearor_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Liblinear_or solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time also fluctuates when the testing size changes among the three random states.
For the 'liblinear' solver using the original data:
it is best to choose testing size about 5% in random state 0. The figures for random state 1 and random state 2 are 10% and 5% respectively.
# Approach 6: solver = liblinear
# Using the scaled dataset
liblinearsc_train_accuracy = np.zeros([3,19])
liblinearsc_test_accuracy = np.zeros([3,19])
liblinearsc_time = np.zeros([3,19])
liblinearsc_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
X_train_scaled = preprocessing.scale(X_train) # scale the dataset
X_test_scaled = preprocessing.scale(X_test)
Start_time = time.time() #Saving current time
LogisticReg = LogisticRegression(penalty = 'l1',solver = 'liblinear',fit_intercept = 'True',intercept_scaling = 10)
LogisticReg.fit(X_train_scaled,y_train)
y_pred_Train = LogisticReg.predict(X_train_scaled)
y_pred_Test = LogisticReg.predict(X_test_scaled)
End_time = time.time() #Saving current time
liblinearsc_size[i,k] = j
liblinearsc1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
liblinearsc_train_accuracy[i,k] = liblinearsc1
liblinearsc2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
liblinearsc_test_accuracy[i,k] = liblinearsc2
liblinearsc3 = float(round(End_time - Start_time,6))
liblinearsc_time[i,k] = liblinearsc3
j+= 0.05
k+= 1
#plot the results with liblinear_sc
x = liblinearsc_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearsc_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,liblinearsc_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Liblinear_sc solver Model_random state_%i' % i) #liblinear_sc means the liblinear with the scaled dataset.
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The accuracy metric scores also look good. There is a steady increase in the training set accuracy value in random state 1 and 2.
# plot the results with liblinear_sc:
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearsc_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Liblinear_sc solver Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time also fluctuates considerably when the testing size changes among the three random states.
For the 'liblinear' solver using the original data:
it is best to choose testing size 20% in random state 0. The figures for random_state 1 and 2 is 10 % and 5% respectively.
Now I will create graphs featuring all the solver options with different testing size in 3 random states.
# Create a comprehensive training accuracy plot:
x = liblinearsc_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,lbfgs_train_accuracy[i,],label='lbfgs', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,sag_train_accuracy[i,],label='sag', color = 'green', marker='*',markersize=8)
ax[i].plot(x,saga_train_accuracy[i,],label='saga', color = 'red', marker='*',markersize=8)
ax[i].plot(x,newton_train_accuracy[i,],label='newton_cg', color = 'magenta', marker='*',markersize=8)
ax[i].plot(x,liblinearor_train_accuracy[i,],label='liblinear_or', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,liblinearsc_train_accuracy[i,],label='liblinear_sc', color = 'cyan', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Algorithm Training Accuracy_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is clear that 'newton' and 'liblinear_sc' stand out in these graphs. However, remember that using 'lbfgs', 'sag', 'saga' and 'newton_cg' brings the highly similar results for the training set accuracy. Therefore, 'newton_cg' being plotted after the three mentioned solver is likely to hide them.
# Create a comprehensive running time plot:
x = liblinearsc_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,lbfgs_time[i,],label='lbfgs', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,sag_time[i,],label='sag', color = 'green', marker='*',markersize=8)
ax[i].plot(x,saga_time[i,],label='saga', color = 'red', marker='*',markersize=8)
ax[i].plot(x,newton_time[i,],label='newton_cg', color = 'magenta', marker='*',markersize=8)
ax[i].plot(x,liblinearor_time[i,],label='liblinear_or', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,liblinearsc_time[i,],label='liblinear_sc', color = 'cyan', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Algorithm Running time_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In all the three running times, 'liblinear_sc' solver helps to bring out the fastest algorithm.
In conclusion, we can see that for the heart dataset, all the solver options bring out the good results (high training accuracy metrics and low running time). However, using 'liblinear' solver with the scaled dataset is the best choice because it is the fastest algorithm option in logistic regression. The most suitable testing size for 'liblinear' solver differs in each running time.
from sklearn import naive_bayes #Naive Bayes
from sklearn.metrics import ConfusionMatrixDisplay
# Approach 1: Naive Bayes Gaussian
# Find the best testing size for this algorithm and run it three times to validate the results consistency.
NB_train_accuracy = np.zeros([3,19])
NB_test_accuracy = np.zeros([3,19])
NB_time = np.zeros([3,19])
NB_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
NBayes = naive_bayes.GaussianNB()
NBayes.fit(X_train,y_train)
Start_time = time.time() #Saving current time
y_pred_Train = NBayes.predict(X_train)
y_pred_Test = NBayes.predict(X_test)
End_time = time.time() #Saving current time
NB_size[i,k] = j
NB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
NB_train_accuracy[i,k] = NB1
NB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
NB_test_accuracy[i,k] = NB2
NB3 = float(round(End_time - Start_time,6))
NB_time[i,k] = NB3
j+= 0.05
k+= 1
#plot the results with Naive Bayes Gaussian
x = NB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NB_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,NB_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Naive Bayes Gaussian Model_random state_%i' % i) #
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The training accuracy metrics look good, having high values (mostly higher than 80%). In random state 1, we recognize that there is a big difference between training accuracy metrics scores and testing accuracy metrics scores.
#plot the results with Naive Bayes Gaussian
x = NB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NB_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Naive Bayes Gaussian Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Running the model for the first time (random state 0), the time used tends to be stable but in fact it is greater than the running time in the remaining states.
For Naive Bayes Gaussian:
In the first time, the best testing size is 20% and in the second and the third time, the best testing size is the same 10%.
# Approach 2: Naive Bayes Bernoulli
# Use the same method as the one with Naive Bayes Gaussian
NBB_train_accuracy = np.zeros([3,19])
NBB_test_accuracy = np.zeros([3,19])
NBB_time = np.zeros([3,19])
NBB_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
NBayes = naive_bayes.BernoulliNB()
NBayes.fit(X_train,y_train)
Start_time = time.time() #Saving current time
y_pred_Train = NBayes.predict(X_train)
y_pred_Test = NBayes.predict(X_test)
End_time = time.time() #Saving current time
NBB_size[i,k] = j
NBB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
NBB_train_accuracy[i,k] = NBB1
NBB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
NBB_test_accuracy[i,k] = NBB2
NBB3 = float(round(End_time - Start_time,6))
NBB_time[i,k] = NBB3
j+= 0.05
k+= 1
#plot the results with Naive Bayes Bernoulli
x = NBB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NBB_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,NBB_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Naive Bayes Bernoulli Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Compared to Gaussian, Naive Bayes Bernoulli witnesses a greater fluctuation in the training accuracy metrics values. The lowest recorded value is 77% in random state 2.
#plot the results with Naive Bayes Bernoulli
x = NBB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NBB_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Naive Bayes Bernoulli Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time of Naive Bayes Bernoulli algorithm is little and it varies when the testing size changes and the random state changes.
In random state 0, there is a gradual decrease in the running time while this value tends to increase in random state 2 and it fluctuate in the left random state.
For Naive Bayes Bernoulli:
In random state 0, the best testing size for Naive Bayes Bernoulli is 5% while the figures for random state 1 and 2 are the same 10%.
# Approach 3: Naive Bayes Complement
# Use the same method as the one with Naive Bayes Gaussian
NBC_train_accuracy = np.zeros([3,19])
NBC_test_accuracy = np.zeros([3,19])
NBC_time = np.zeros([3,19])
NBC_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
NBayes = naive_bayes.ComplementNB()
NBayes.fit(X_train,y_train)
Start_time = time.time() #Saving current time
y_pred_Train = NBayes.predict(X_train)
y_pred_Test = NBayes.predict(X_test)
End_time = time.time() #Saving current time
NBC_size[i,k] = j
NBC1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
NBC_train_accuracy[i,k] = NBC1
NBC2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
NBC_test_accuracy[i,k] = NBC2
NBC3 = float(round(End_time - Start_time,6))
NBC_time[i,k] = NBC3
j+= 0.05
k+= 1
#plot the results with Naive Bayes Complement
x = NBC_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NBC_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,NBC_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Naive Bayes Complement Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is obvious that the training accuracy value tends to decrease from state to state although it is still high (the lowest value is 71%).
#plot the results with Naive Bayes Complement
x = NBC_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NBC_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Naive Bayes Complement Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Running time in random state 2 is highly stable compared to other random states.
For Naive Bayes Complement:
In random state 0, it is best to pick a testing size = 5%, while the figures for random state 1 and 2 are 10% and 5% respectively.
#plot comprehensive graphs with Naive Bayes Algorithm
x = NBB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NB_train_accuracy[i,],label='Gaussian', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,NBB_train_accuracy[i,],label='Bernoulli', color = 'red', marker='*',markersize=8)
ax[i].plot(x,NBC_train_accuracy[i,],label='Complement', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Naive Bayes Algorithm Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is obvious that Gaussian and Bernoulli bring out the better and more stable results than Complement.
#plot comprehensive graphs with Naive Bayes Algorithm
x = NBB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,NB_time[i,],label='Gaussian', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,NBB_time[i,],label='Bernoulli', color = 'red', marker='*',markersize=8)
ax[i].plot(x,NBC_time[i,],label='Complement', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Naive Bayes Algorithm Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is hard to point out which algorithm option is the fastest one based on these graphs.
All things considered, when we use Naive Bayes Algorithm for the heart dataset, the results obtained with all the three methods are good (high accuracy and shor running time). However, Gaussian and Bernoulli should be evaluated better than the remaining.
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.tree import plot_tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
For the classification tree, we care about two parameters: the criterion ('Gini','Entropy' / 'Gini' as default) and the max_depth (int, default=None / The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples).
# Approach 1: Decision Tree using Gini
# Build a learning curve for the test size
# max_depth = default
DTG_train_accuracy = np.zeros([3,19])
DTG_test_accuracy = np.zeros([3,19])
DTG_time = np.zeros([3,19])
DTG_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier() #Gini is the default classifier
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTG_size[i,k] = j
DTG1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTG_train_accuracy[i,k] = DTG1
DTG2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTG_test_accuracy[i,k] = DTG2
DTG3 = float(round(End_time - Start_time,6))
DTG_time[i,k] = DTG3
j+= 0.05
k+= 1
#plot the results with Decision Tree using Gini
x = DTG_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTG_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,DTG_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The training set accuracy is always 100% while the testing set accuracy varies. In random state 0, a testing size of 25% is the best while the figures for random state 1 and 2 are 10% and 5% respectively. A testing size of 10% will be fairly good in all the three random states.
#plot the results with Decision Tree using Gini
x = DTG_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTG_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time recorded in random state 1 is the lowest. Also, running the model with a testing size of 10% will take little time.
# Approach 1: Decision Tree using Gini
# Build a learning curve for the max_depth parameter
# choose test size = 10%
DTGM_train_accuracy = np.zeros([3,27])
DTGM_test_accuracy = np.zeros([3,27])
DTGM_time = np.zeros([3,27])
DTGM_size = np.zeros([3,27])
for i in range (3):
j = 1
k = 0
while j <=27:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier(max_depth=j) #Gini is the default classifier
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTGM_size[i,k] = j
DTGM1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTGM_train_accuracy[i,k] = DTGM1
DTGM2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTGM_test_accuracy[i,k] = DTGM2
DTGM3 = float(round(End_time - Start_time,6))
DTGM_time[i,k] = DTGM3
j+= 1
k+= 1
#plot the results with Decision Tree using Gini
x = DTGM_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTGM_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,DTGM_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Max_Depth')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Inferred from the graphs, in random state 0 and 1, the best max_depth number is 4 while the figure for random state 2 is 6. I will choose max_depth = 4.
Here we have a combination between testting size and max_depth for Decision Tree using Gini: testing size = 10% and max_depth = 4.
#plot the results with Decision Tree using Gini
x = DTGM_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTGM_time[i,],label='Running time', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Max_Depth')
ax[i].set_ylabel('Running time')
ax[i].set_title('Decision Tree using Gini Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The running time of the algorithm using Gini with different Max_Depth varies considerably. With max_depth = 4, the running time is relatively low.
# Decision Tree using Gini with max_depth = 4 with different testing size in different times
# These below results will be used for later comparison
DTGN_train_accuracy = np.zeros([3,19])
DTGN_test_accuracy = np.zeros([3,19])
DTGN_time = np.zeros([3,19])
DTGN_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier(max_depth=4) #Gini is the default classifier
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTGN_size[i,k] = j
DTGN1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTGN_train_accuracy[i,k] = DTGN1
DTGN2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTGN_test_accuracy[i,k] = DTGN2
DTGN3 = float(round(End_time - Start_time,6))
DTGN_time[i,k] = DTGN3
j+= 0.05
k+= 1
# Approach 2: Decision Tree using Entropy
# Build a learning curve for the test size
# max_depth = default
DTE_train_accuracy = np.zeros([3,19])
DTE_test_accuracy = np.zeros([3,19])
DTE_time = np.zeros([3,19])
DTE_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier(criterion='entropy')
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTE_size[i,k] = j
DTE1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTE_train_accuracy[i,k] = DTE1
DTE2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTE_test_accuracy[i,k] = DTE2
DTE3 = float(round(End_time - Start_time,6))
DTE_time[i,k] = DTE3
j+= 0.05
k+= 1
#plot the results with Decision Tree using Entropy
x = DTE_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTE_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,DTE_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Decision Tree using Entropy Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The best testing size for all the random states is 5%, 10% and 5% respectively. It is sensible to choose a shared value of 5% for all the random states.
# Approach 2: Decision Tree using Gini
# Build a learning curve for the max_depth parameter
# choose test size = 5%
DTEM_train_accuracy = np.zeros([3,27])
DTEM_test_accuracy = np.zeros([3,27])
DTEM_time = np.zeros([3,27])
DTEM_size = np.zeros([3,27])
for i in range (3):
j = 1
k = 0
while j <=27:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier(criterion='entropy',max_depth=j) #Gini is the default classifier
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTEM_size[i,k] = j
DTEM1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTEM_train_accuracy[i,k] = DTEM1
DTEM2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTEM_test_accuracy[i,k] = DTEM2
DTEM3 = float(round(End_time - Start_time,6))
DTEM_time[i,k] = DTEM3
j+= 1
k+= 1
#plot the results with Decision Tree using Entropy
x = DTEM_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTEM_train_accuracy[i,],label='Training set accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,DTEM_test_accuracy[i,],label='Testing set accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Decision Tree using Entropy Model_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The suitable max_depth for three random states is 4, 6 and 12 respectively. The number of 8 max_depth will be fairly good in all the random states.
# Decision Tree using Entropy with max_depth = 8 with different testing size in different times
# These below results will be used for later comparison
DTEY_train_accuracy = np.zeros([3,19])
DTEY_test_accuracy = np.zeros([3,19])
DTEY_time = np.zeros([3,19])
DTEY_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dtree = DecisionTreeClassifier(criterion = 'entropy',max_depth=8)
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
End_time = time.time() #Saving current time
DTEY_size[i,k] = j
DTEY1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
DTEY_train_accuracy[i,k] = DTEY1
DTEY2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
DTEY_test_accuracy[i,k] = DTEY2
DTEY3 = float(round(End_time - Start_time,6))
DTEY_time[i,k] = DTEY3
j+= 0.05
k+= 1
Now we will compare Gini and Entropy in their best combination.
# plot the comparison
x = DTEY_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTGN_train_accuracy[i,],label='Gini Training accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,DTGN_test_accuracy[i,],label='Gini Testing accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,DTEY_train_accuracy[i,],label='Entropy Training accuracy', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,DTEY_test_accuracy[i,],label='Entropy Testing accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Decision Tree using Gini Entropy_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is obvious that Entropy bring better results than Gini. Also, in their preferred testing size (10% for Gini and 5% for Entropy), the models using these two methods are good for predicting.
# plot the comparison
x = DTEY_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTGN_time[i,],label='Gini running time', color = 'red', marker='*',markersize=8)
ax[i].plot(x,DTEY_time[i,],label='Entropy running time', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Decision Tree using Gini Entropy_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is obvious that the running time of model using Entropy is mostly greater than the figure for Gini.
In conclusion, if we use the classification tree for the heart dataset, it is better to adopt the Entropy method with a combination of 5% testing size and max_depth = 8.
# plot the tree with the findings:
import os
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)
dtree = DecisionTreeClassifier(criterion='entropy', max_depth=8)
dtree.fit(X_train,y_train)
y_pred_Train = dtree.predict(X_train) #Predictions
y_pred_Test = dtree.predict(X_test) #Predictions
dot_data = StringIO()
os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"
export_graphviz(dtree, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('heart_prunned.png')
Image(graph.create_png())
In terms of the random forest, there are many parameter to be cared about, such as:
n_estimators: int,default=100 (the number of trees in the forest)
criterion: {“gini”, “entropy”} default=”gini” (the function to measure the quality of a split)
max_depth: int, default=None (the maximum depth of the tree)
max_features: {“auto”, “sqrt”, “log2”} int or float / default=”auto” (number of features to consider when looking for the best split). "auto" has the same function as "sqrt".
In this project we will focus on the max_features parameter, and find the best testing size to combine with the options of max_features. This work will also compare the results obtained using Gini and Entropy Function in different times. Other parameters will be set to their default values.
from sklearn.ensemble import RandomForestClassifier
# Approach 1: Random Forest using Gini and "sqrt" max_features
# build a learning curve for the testing size
RFGS_train_accuracy = np.zeros([3,19])
RFGS_test_accuracy = np.zeros([3,19])
RFGS_time = np.zeros([3,19])
RFGS_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dForest = RandomForestClassifier(criterion = 'gini',max_features='sqrt')
dForest.fit(X_train,y_train)
y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions
End_time = time.time() #Saving current time
RFGS_size[i,k] = j
RFGS1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
RFGS_train_accuracy[i,k] = RFGS1
RFGS2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
RFGS_test_accuracy[i,k] = RFGS2
RFGS3 = float(round(End_time - Start_time,6))
RFGS_time[i,k] = RFGS3
j+= 0.05
k+= 1
# plot the obtained results:
x = RFGS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFGS_train_accuracy[i,],label='Gini Sqrt Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFGS_test_accuracy[i,],label='Gini Sqrt Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Random Forest using Gini and Sqrt_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The training accuracy metrics is always 100% while the value for testing tends to decrease. The best testing size for each random state is 20%, 5% and 5% respectively. A testing size of 10% is extremely good in all the random states.
# Approach 2: Random Forest using Gini and "log2" max_features
# build a learning curve for the testing size
RFGL_train_accuracy = np.zeros([3,19])
RFGL_test_accuracy = np.zeros([3,19])
RFGL_time = np.zeros([3,19])
RFGL_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dForest = RandomForestClassifier(criterion = 'gini',max_features='log2')
dForest.fit(X_train,y_train)
y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions
End_time = time.time() #Saving current time
RFGL_size[i,k] = j
RFGL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
RFGL_train_accuracy[i,k] = RFGL1
RFGL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
RFGL_test_accuracy[i,k] = RFGL2
RFGL3 = float(round(End_time - Start_time,6))
RFGL_time[i,k] = RFGL3
j+= 0.05
k+= 1
# plot the obtained results:
x = RFGL_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFGL_train_accuracy[i,],label='Gini Log2 Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFGL_test_accuracy[i,],label='Gini Log2 Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Random Forest using Gini and Log2_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The best testing size for each random state is 20%, 5% and 5% in turn. A testing size of 10% is extremely good in all the random states.
# Approach 3: Random Forest using Entropy and "sqrt" max_features
# build a learning curve for the testing size
RFES_train_accuracy = np.zeros([3,19])
RFES_test_accuracy = np.zeros([3,19])
RFES_time = np.zeros([3,19])
RFES_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dForest = RandomForestClassifier(criterion = 'entropy',max_features='sqrt')
dForest.fit(X_train,y_train)
y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions
End_time = time.time() #Saving current time
RFES_size[i,k] = j
RFES1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
RFES_train_accuracy[i,k] = RFES1
RFES2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
RFES_test_accuracy[i,k] = RFES2
RFES3 = float(round(End_time - Start_time,6))
RFES_time[i,k] = RFES3
j+= 0.05
k+= 1
# plot the obtained results:
x = RFES_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFES_train_accuracy[i,],label='Entropy sqrt Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFES_test_accuracy[i,],label='Entropy sqrt Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Random Forest using Entropy and sqrt_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The best testing size for each random state is 15%, 10% and 5% in turn. A testing size of 10% is extremely good in all the random states.
# Approach 3: Random Forest using Entropy and "log2" max_features
# build a learning curve for the testing size
RFEL_train_accuracy = np.zeros([3,19])
RFEL_test_accuracy = np.zeros([3,19])
RFEL_time = np.zeros([3,19])
RFEL_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
dForest = RandomForestClassifier(criterion = 'entropy',max_features='log2')
dForest.fit(X_train,y_train)
y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions
End_time = time.time() #Saving current time
RFEL_size[i,k] = j
RFEL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
RFEL_train_accuracy[i,k] = RFEL1
RFEL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
RFEL_test_accuracy[i,k] = RFEL2
RFEL3 = float(round(End_time - Start_time,6))
RFEL_time[i,k] = RFEL3
j+= 0.05
k+= 1
# plot the obtained results:
x = RFEL_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFEL_train_accuracy[i,],label='Entropy log2 Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFEL_test_accuracy[i,],label='Entropy log2 Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Random Forest using Entropy and log2_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The best testing size for each random state is 20%, 10% and 5% in turn. A testing size of 18% is extremely good in all the random states.
# compare all the obtained results
# plot the comparison
x = RFEL_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFGS_test_accuracy[i,],label='Gini sqrt test accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFGL_test_accuracy[i,],label='Gini log2 Test accuracy', color = 'red', marker='*',markersize=8)
ax[i].plot(x,RFES_test_accuracy[i,],label='Entropy sqrt test accuracy', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,RFEL_test_accuracy[i,],label='Entropy log2 Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Random Forest using Gini and Entropy_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Overall, it is not easy to tell which label is attached to the best results based on the graphs.
# compare all the obtained results
# plot the comparison
x = RFEL_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,RFGS_time[i,],label='Gini sqrt running time', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,RFGL_time[i,],label='Gini log2 running time', color = 'red', marker='*',markersize=8)
ax[i].plot(x,RFES_time[i,],label='Entropy sqrt running time', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,RFEL_time[i,],label='Entropy log2 running time', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Random Forest using Gini and Entropy_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Also, it is hard to compare the running time of these four methods.
Overall, for dealing with the heart dataset, when it comes to random forest, if we just play with criterion and max_features parameter, the derived results are similar and there are no outstanding ones, which indicates that at least one important parameter might not be considered in our case. However, all the obtained results are still good.
# plot the tree with one option: Entropy and log2
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.18)
dForest = RandomForestClassifier(criterion = 'entropy',max_features='log2')
dForest.fit(X_train,y_train)
y_pred_Train = dForest.predict(X_train) #Predictions
y_pred_Test = dForest.predict(X_test) #Predictions
dot_data = StringIO()
os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"
export_graphviz(dtree, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('heart_prunned.png')
Image(graph.create_png())
In K Nearest Neighbors (KNN), we have to care about one important parameter: n_neighbors
Similar to the work with other algorithms, in this part I will continue to try to find a best combination of testing size and parameter after running the algorithms several times.
Firstly, I will try to obtain the best n_neighbors. The n_neighbors can range between 1 and the number of all the datapoints in the dataset. The dataset has 303 datapoints as a total. I will create a loop for n_neighbors which runs from 1 to 50. About the test size, at first I choose three different test size: 10%, 20% and 40%.
from sklearn.neighbors import KNeighborsClassifier
# Test size = 10%
# Run the model three times, three random states
KNN_train_accuracy = np.zeros([3,50])
KNN_test_accuracy = np.zeros([3,50])
KNN_time = np.zeros([3,50])
KNN_number = np.zeros([3,50])
for i in range (3):
j = 1
k = 0
while j <=50:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,random_state=i)
Start_time = time.time() #Saving current time
KNN = KNeighborsClassifier(n_neighbors=j)
KNN.fit(X_train,y_train)
y_pred_Train = KNN.predict(X_train) #Predictions
y_pred_Test = KNN.predict(X_test) #Predictions
End_time = time.time() #Saving current time
KNN_number[i,k] = j
KNN1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
KNN_train_accuracy[i,k] = KNN1
KNN2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
KNN_test_accuracy[i,k] = KNN2
KNN3 = float(round(End_time - Start_time,6))
KNN_time[i,k] = KNN3
j+= 1
k+= 1
# plot the obtained results:
x = KNN_number[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNN_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,KNN_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Number of Neighbors')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('KNN_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Based on the graphs, a number of neighbors around 25 when the testing size = 10% will be extremely good.
# Test size = 20%
# Run the model three times, three random states
KNNa_train_accuracy = np.zeros([3,50])
KNNa_test_accuracy = np.zeros([3,50])
KNNa_time = np.zeros([3,50])
KNNa_number = np.zeros([3,50])
for i in range (3):
j = 1
k = 0
while j <=50:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=i)
Start_time = time.time() #Saving current time
KNN = KNeighborsClassifier(n_neighbors=j)
KNN.fit(X_train,y_train)
y_pred_Train = KNN.predict(X_train) #Predictions
y_pred_Test = KNN.predict(X_test) #Predictions
End_time = time.time() #Saving current time
KNNa_number[i,k] = j
KNNa1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
KNNa_train_accuracy[i,k] = KNNa1
KNNa2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
KNNa_test_accuracy[i,k] = KNNa2
KNNa3 = float(round(End_time - Start_time,6))
KNNa_time[i,k] = KNNa3
j+= 1
k+= 1
# plot the obtained results:
x = KNNa_number[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNNa_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,KNNa_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Number of Neighbors')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('KNN_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In random state 0, n_neighbors around 40 is good while in random state 1 and 2, n_neighbors between 20 and 25 is good.
# Test size = 40%
# Run the model three times, three random states
KNNb_train_accuracy = np.zeros([3,50])
KNNb_test_accuracy = np.zeros([3,50])
KNNb_time = np.zeros([3,50])
KNNb_number = np.zeros([3,50])
for i in range (3):
j = 1
k = 0
while j <=50:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=i)
Start_time = time.time() #Saving current time
KNN = KNeighborsClassifier(n_neighbors=j)
KNN.fit(X_train,y_train)
y_pred_Train = KNN.predict(X_train) #Predictions
y_pred_Test = KNN.predict(X_test) #Predictions
End_time = time.time() #Saving current time
KNNb_number[i,k] = j
KNNb1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
KNNb_train_accuracy[i,k] = KNNb1
KNNb2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
KNNb_test_accuracy[i,k] = KNNb2
KNNb3 = float(round(End_time - Start_time,6))
KNNb_time[i,k] = KNNb3
j+= 1
k+= 1
# plot the obtained results:
x = KNNb_number[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNNb_train_accuracy[i,],label='Train accuracy', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,KNNb_test_accuracy[i,],label='Test accuracy', color = 'green', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Number of Neighbors')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('KNN_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
In random state 0, n_neighbors = 25 is good while the figures for random state 1 and 2 are 20 and around 3 respectively.
# compare the obtained results with three different testing size
# plot the testing accuracy
x = KNNb_number[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNN_test_accuracy[i,],label='testing size = 0.1', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,KNNa_test_accuracy[i,],label='testing size = 0.2', color = 'red', marker='*',markersize=8)
ax[i].plot(x,KNNb_test_accuracy[i,],label='testing size = 0.4', color = 'yellow', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('Number of Neighbors')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('KNN_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Based on the graphs, we find that a number of neighbors ranging between 20 and 25 will be enough good for all the random state. Pick n_neighbors = 23.
# Build a learning curve for the testing size when n_neighbors = 23.
# n_neighbors = 23 => the training set number of datapoints must be greater or equal to 23,
# so training size >= 0.076 and the test size <= 0.924
KNNc_train_accuracy = np.zeros([3,17])
KNNc_test_accuracy = np.zeros([3,17])
KNNc_time = np.zeros([3,17])
KNNc_size = np.zeros([3,17])
for i in range (3):
j = 0.05
k = 0
while j <=0.9:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
KNN = KNeighborsClassifier(n_neighbors=23)
KNN.fit(X_train,y_train)
y_pred_Train = KNN.predict(X_train) #Predictions
y_pred_Test = KNN.predict(X_test) #Predictions
End_time = time.time() #Saving current time
KNNc_size[i,k] = j
KNNc1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
KNNc_train_accuracy[i,k] = KNNc1
KNNc2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
KNNc_test_accuracy[i,k] = KNNc2
KNNc3 = float(round(End_time - Start_time,6))
KNNc_time[i,k] = KNNc3
j+= 0.05
k+= 1
# compare the obtained results in three different testing size
# plot the testing accuracy
x = KNNc_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNNc_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,KNNc_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('KNN with n_neighbors = 23_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Based on the graphs, we can see that a testing size = 0.1 is enough good for predicting in all the three random states.
# compare the obtained results in three different testing size
# plot the testing accuracy
x = KNNc_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,KNNc_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('KNN with n_neighbors = 23_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Random state 1 witnesses the lowest running time of KNN. testing size = 0.1 makes the algorithm take more time to run than other testing size.
In conclusion, we have a combination of testing size = 0.1 (10%) and n_neighbors = 23.
from sklearn.ensemble import ExtraTreesClassifier
# Build a learning curve for the testing size.
ET_train_accuracy = np.zeros([3,19])
ET_test_accuracy = np.zeros([3,19])
ET_time = np.zeros([3,19])
ET_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
ETrees = ExtraTreesClassifier()
ETrees.fit(X_train,y_train)
y_pred_Train = ETrees.predict(X_train) #Predictions
y_pred_Test = ETrees.predict(X_test) #Predictions
End_time = time.time() #Saving current time
ET_size[i,k] = j
ET1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
ET_train_accuracy[i,k] = ET1
ET2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
ET_test_accuracy[i,k] = ET2
ET3 = float(round(End_time - Start_time,6))
ET_time[i,k] = ET3
j+= 0.05
k+= 1
# plot the results
x = ET_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,ET_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,ET_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Extra Trees_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The obtained training accuracy metrics with extra trees is always 100%. The best testing size for each random state is 18%, 10% and 5% respectively. A testing size of 10% is good enough for all the three random states.
# plot the results
x = ET_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,ET_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('running time')
ax[i].set_title('Extra Trees_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
the running time of the algorithm in random state 0 is greater than the figure for other random states. 10% testing size makes running the algorithm faster than the algorithm using most of the other testing size in random state 0 and 1.
from sklearn.ensemble import GradientBoostingClassifier
# Build a learning curve for the testing size.
GB_train_accuracy = np.zeros([3,19])
GB_test_accuracy = np.zeros([3,19])
GB_time = np.zeros([3,19])
GB_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
GBoost = GradientBoostingClassifier()
GBoost.fit(X_train,y_train)
y_pred_Train = GBoost.predict(X_train) #Predictions
y_pred_Test = GBoost.predict(X_test) #Predictions
End_time = time.time() #Saving current time
GB_size[i,k] = j
GB1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
GB_train_accuracy[i,k] = GB1
GB2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
GB_test_accuracy[i,k] = GB2
GB3 = float(round(End_time - Start_time,6))
GB_time[i,k] = GB3
j+= 0.05
k+= 1
# plot the results
x = GB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,GB_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,GB_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Gradient Boost_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
The training accuracy is 100% most of the time in all three random states. The best testing size for each random state is 30%, 10% and 10% respectively. 10% testing size is good enough in all three random states.
# plot the results
x = GB_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,GB_time[i,],label='running time', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('running time')
ax[i].set_title('Gradient Boost_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
testing size = 10% makes the algorithm run faster than the algorithm using most of other testing size in random state 0. However, this does not happen in the other random states.
from sklearn.svm import SVC
When it comes to Support Vector Classifier, we need to care about the kernel parameter.
kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples). This does not occur in the heart dataset.
# Approach 1: SVC with linear
# build a learning curve for testing size
# Build a learning curve for the testing size.
SVCL_train_accuracy = np.zeros([3,19])
SVCL_test_accuracy = np.zeros([3,19])
SVCL_time = np.zeros([3,19])
SVCL_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
SVector = SVC(kernel = 'linear')
SVector.fit(X_train,y_train)
y_pred_Train = SVector.predict(X_train) #Predictions
y_pred_Test = SVector.predict(X_test) #Predictions
End_time = time.time() #Saving current time
SVCL_size[i,k] = j
SVCL1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
SVCL_train_accuracy[i,k] = SVCL1
SVCL2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
SVCL_test_accuracy[i,k] = SVCL2
SVCL3 = float(round(End_time - Start_time,6))
SVCL_time[i,k] = SVCL3
j+= 0.05
k+= 1
# plot the results:
x = SVCL_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCL_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCL_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('SVC Linear_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Training and testing accuracy metrics look good. Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.
# Approach 2: SVC with poly
# build a learning curve for testing size
# Build a learning curve for the testing size.
SVCP_train_accuracy = np.zeros([3,19])
SVCP_test_accuracy = np.zeros([3,19])
SVCP_time = np.zeros([3,19])
SVCP_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
SVector = SVC(kernel = 'poly')
SVector.fit(X_train,y_train)
y_pred_Train = SVector.predict(X_train) #Predictions
y_pred_Test = SVector.predict(X_test) #Predictions
End_time = time.time() #Saving current time
SVCP_size[i,k] = j
SVCP1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
SVCP_train_accuracy[i,k] = SVCP1
SVCP2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
SVCP_test_accuracy[i,k] = SVCP2
SVCP3 = float(round(End_time - Start_time,6))
SVCP_time[i,k] = SVCP3
j+= 0.05
k+= 1
# plot the results:
x = SVCP_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCP_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCP_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('SVC Poly_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.
# Approach 3: SVC with rbf
# build a learning curve for testing size
# Build a learning curve for the testing size.
SVCR_train_accuracy = np.zeros([3,19])
SVCR_test_accuracy = np.zeros([3,19])
SVCR_time = np.zeros([3,19])
SVCR_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
SVector = SVC(kernel = 'rbf')
SVector.fit(X_train,y_train)
y_pred_Train = SVector.predict(X_train) #Predictions
y_pred_Test = SVector.predict(X_test) #Predictions
End_time = time.time() #Saving current time
SVCR_size[i,k] = j
SVCR1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
SVCR_train_accuracy[i,k] = SVCR1
SVCR2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
SVCR_test_accuracy[i,k] = SVCR2
SVCR3 = float(round(End_time - Start_time,6))
SVCR_time[i,k] = SVCR3
j+= 0.05
k+= 1
# plot the results:
x = SVCR_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCR_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCR_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('SVC Rcf_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Testing size = 5% is suitable for random state 0 and 2, 10% is for the remaining.
# Approach 4: SVC with sigmoid
# build a learning curve for testing size
# Build a learning curve for the testing size.
SVCS_train_accuracy = np.zeros([3,19])
SVCS_test_accuracy = np.zeros([3,19])
SVCS_time = np.zeros([3,19])
SVCS_size = np.zeros([3,19])
for i in range (3):
j = 0.05
k = 0
while j <=1:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=j,random_state=i)
Start_time = time.time() #Saving current time
SVector = SVC(kernel = 'sigmoid')
SVector.fit(X_train,y_train)
y_pred_Train = SVector.predict(X_train) #Predictions
y_pred_Test = SVector.predict(X_test) #Predictions
End_time = time.time() #Saving current time
SVCS_size[i,k] = j
SVCS1 = float(round(metrics.accuracy_score(y_train,y_pred_Train),4))
SVCS_train_accuracy[i,k] = SVCS1
SVCS2 = float(round(metrics.accuracy_score(y_test, y_pred_Test),4))
SVCS_test_accuracy[i,k] = SVCS2
SVCS3 = float(round(End_time - Start_time,6))
SVCS_time[i,k] = SVCS3
j+= 0.05
k+= 1
# plot the results:
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCS_train_accuracy[i,],label='training accuracy', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCS_test_accuracy[i,],label='testing accuracy', color = 'red', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('SVC Sigmoid_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
Testing size = 25% is suitable for random state 0, while the figures for random state 1 and 2 is 30% and 40% respectively.
# compared the obtained results
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCL_train_accuracy[i,],label='linear', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCP_train_accuracy[i,],label='poly', color = 'red', marker='*',markersize=8)
ax[i].plot(x,SVCR_train_accuracy[i,],label='rcf', color = 'green', marker='*',markersize=8)
ax[i].plot(x,SVCS_train_accuracy[i,],label='sigmoid', color = 'yellow', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Training Accuracy')
ax[i].set_title('SVC with different kernels_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It obvious that 'linear' kernel brings out the better results than other options.
# compared the obtained results
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,SVCL_time[i,],label='linear', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,SVCP_time[i,],label='poly', color = 'red', marker='*',markersize=8)
ax[i].plot(x,SVCR_time[i,],label='rcf', color = 'green', marker='*',markersize=8)
ax[i].plot(x,SVCS_time[i,],label='sigmoid', color = 'yellow', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('SVC with different kernels_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
However, running the algorithm with 'linear' kernel slows down the algorithm.
All things considered, SVC performs best in this case with 'linear' kernel. But if we worry about the running time, we can choose 'poly' kernel as an alternative.
For the classification problem with the heart dataset, the following algorithms are considered: Logistic Regression, Naïve Bayes, Classification Tree, Random Forest, K Nearest Neighbors, Extra Tree Classifier, Gradient Boost Classifier, Support Vector Classifier.
Overall, all the algorithms performs well with the heart dataset (high accuracy metrics values and short running time).
When it comes to findining the best testing size for each algorithm, it is not an easy work because the most suitable testing size varies when running the algorithms multiple times (different random states). I do make some recommendations in my work.
There is some findings derived for some algorithms in dealing with the heart dataset:
Logistic Regression: performs best when using 'Liblinear' solver after scaling the dataset.
Naïve Bayes: Gaussian and Bernoulli perform better than Complement
Classification Tree: 'Entropy' criterion and and Max_Depth = 8 make a best combination.
Random Forest: It is hard to compare among the methods, some important parameters might not have been considered.
KNN: n_neighbors = 23 brings out one of the best result.
Support vector Classifier: 'linear' kernel is the best option.
We can divide the algorithms in two groups:
Group 1: Logistic Regression, Naïve Bayes, K Nearest Neighbors.
Group 2: Classification Tree, Random Forest, Extra Tree Classifier, Gradient Boost Classifier, Support Vector Classifier.
The second group including more complex algorithms. Now, we will compare the running time of the algorithms in each group. Each algorithm, we just choose one best option.
Group 1
# plot the training accuracy metrics
a = np.array([[1,1],
[1,1],
[1,1]])
KNNd_train_accuracy = np.append(KNNc_train_accuracy,a,axis=1) #because the result matrix of KNN has different dimension [3,17]).
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearsc_train_accuracy[i,],label='Logistic Regression', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,NB_train_accuracy[i,],label='Naive Bayes', color = 'red', marker='*',markersize=8)
ax[i].plot(x,KNNd_train_accuracy[i,],label='K nearest neighbors', color = 'cyan', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Accuracy')
ax[i].set_title('Different Algorithms_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
# plot the running time
b = np.array([[0,0],
[0,0],
[0,0]])
KNNd_time = np.append(KNNc_time,b,axis=1)
#because the result matrix of KNN has different dimension [3,17]).
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,liblinearsc_time[i,],label='Logistic Regression', color = 'blue', marker='*',markersize=8)
ax[i].plot(x,NB_time[i,],label='Naive Bayes', color = 'red', marker='*',markersize=8)
ax[i].plot(x,KNNd_time[i,],label='K nearest neighbors', color = 'cyan', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Different Algorithms_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
So, Logistic Regression is the best algorithm in Group 1.
Group 2
# plot the training accuracy metrics
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTEY_train_accuracy[i,],label='Classification Tree', color = 'green', marker='*',markersize=8)
ax[i].plot(x,RFGS_train_accuracy[i,],label='Random Forest', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,ET_train_accuracy[i,],label='Extra Trees', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,GB_train_accuracy[i,],label='Gradient Boost', color = 'black', marker='*',markersize=8)
ax[i].plot(x,SVCL_train_accuracy[i,],label='Support Vector Classifier', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Different Algorithms_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
# plot the running time
x = SVCS_size[0,]
fig, ax = plt.subplots(ncols=3, figsize=(15,4))
for i in range(3):
ax[i].plot(x,DTEY_time[i,],label='Classification Tree', color = 'green', marker='*',markersize=8)
ax[i].plot(x,RFGS_time[i,],label='Random Forest', color = 'navy', marker='*',markersize=8)
ax[i].plot(x,ET_time[i,],label='Extra Trees', color = 'yellow', marker='*',markersize=8)
ax[i].plot(x,GB_time[i,],label='Gradient Boost', color = 'black', marker='*',markersize=8)
ax[i].plot(x,SVCL_time[i,],label='Support Vector Classifier', color = 'magenta', marker='*',markersize=8)
ax[i].grid()
ax[i].set_xlabel('testing size')
ax[i].set_ylabel('Running time')
ax[i].set_title('Different Algorithms_random state_%i' % i)
ax[i].legend()
fig.tight_layout(pad = 0.5)
plt.show()
It is obvious that more complex algorithms (as group 2) need more time to run than group 1. For the heart dataset, it seems that we do not need to use complex algorithms. The ones in group 1 already bring out good results. In group 2, Classification tree is the fastest algorithm.
In conclusion, to recommend two best algorithms, each for one group, I would mention: Logistic Regression and Classifiaction Tree.